Analyzing the Success of the Top 10 Pizza Chains in the U.S.

Tutorial by Katherine Wang and Kaitlin Sim

Why do we want to know what contributes to a pizza chain's success?

Pizza is an iconic staple for Americans and is one of the world’s most popular foods. With such a large variety of pizza chains in America, the best and worst pizza chains are highly debated. Well known chains like Domino’s and Pizza Hut are either loved or hated, but how are they still so successful? In this tutorial, we will investigate different factors that could potentially affect the success of a pizza chain by analyzing the menus and Yelp reviews of the top ten highest revenue pizza chains. We will use revenue as the measurement of success. In order of highest to lowest revenue, these top chains include Domino’s Pizza, Pizza Hut, Little Caesars Pizza, Papa John’s, California Pizza Kitchen, Papa Murphy’s Pizza, Sbarro, Marco’s Pizza, Chuck E. Cheese’s, and Round Table Pizza. We want to only look at these successful chains so that we can compare them and see if they have anything in common that could link to their success.

Part 1: Data Curation, Parsing, and Management

Overview

We looked at several different datasets to analyze what makes a pizza chain successful. We web scraped the top ten pizza chain business data from Pizza Today's 2019 Top 100 Pizza Companies. We also used the Yelp Dataset to look at reviews and ratings of different chains and Datafiniti’s Pizza Restaurants and the Pizza They Sell dataset to look at menu items and prices.

Get Top Pizza Chains and Their Revenue

Let’s first web scrape the top ten pizza chain data from Pizza Today's 2019 Top 100 Pizza Companies. We parsed the data it into a pandas DataFrame with the respective column names:

Clean and Merge Yelp Business and Review Datasets

Now that we know what the top pizza chains by revenue are, we can begin cleaning and processing the other datasets. In the Yelp Dataset, we will only use yelp_academic_dataset_business.json and yelp_academic_dataset_review.json since we only want to look at review and star rating data.

In yelp_academic_dataset_review.json, there is a field called “business_id” instead of having a field for the business name, so we need to use yelp_academic_dataset_business.json, which also has the “business_id” field, to get the corresponding names of the businesses for each review. Before doing that, we will first clean the business dataset and only keep the pizza businesses. Now, we can merge both datasets into one dataframe.

Since both original datasets are quite large and this code takes a while to run, we will save the merged dataframe as a new csv file called yelp_reviews_Pizza_categories.csv for efficiency.

Clean Pizza Subset of Yelp Data and Datafiniti Dataset

Now that we have the pizza subset of our merged Yelp data, we can do some more cleaning on the pizza chain data and also clean the Datafiniti dataset. Each row in the Datafiniti dataset represents a pizza from a restaurant and contains information about the business, prices, and more. After filtering out columns in both datasets that we didn’t need, we removed rows that were not related to the top ten pizza chains and standardized the spelling and punctuation of the chains’ names.

The top pizza chain subset of the Yelp data contain the below fields and each row represents a review:

Here are the fields in the cleaned top pizza chain subset of the Datafiniti dataset:

We noticed that in the Datafiniti dataset, there was a lack of data from Chuck E. Cheese’s. This could be due to the fact that the Datafiniti dataset, which we downloaded from Kaggle, is only a sample of a larger Datafiniti dataset, which is unfortunately not free. Also, for some reason, all of the prices for the Chuck E. Cheese’s pizzas were zero. To resolve these issues, we manually added Chuck E. Cheese’s data to the Datafiniti data using the online ordering menu on Chuck E. Cheese’s’ website.

Part 2: Exploratory Data Analysis

After cleaning and parsing our datasets, we can now begin analyzing our data and creating visualizations.

Yelp Reviews

Let’s first look at the Yelp reviews. We want to analyze this because we want to see if existing opinions have any effect on a pizza chain’s success. For this section, we looked at the review star ratings of each chain and calculated the ratios and counts of each star rating and average star rating of each pizza chain.

Prepare Yelp Review data for plot creation

Plot: Star Rating Frequencies by Chain

Plot: Star Rating Ratios by Chain

Combined Analysis of Star Rating Frequencies by Chain and Star Rating Ratios by Chain

In both plots, it is clear that for most pizza chains, the most common star rating is one star, even though these pizza chains have the highest revenue. This could be due to the fact that people tend to only write reviews when they have had a particularly good or bad experience. However, for California Pizza Kitchen the most common star rating is four stars and five stars for Papa Murphy’s Pizza and Marco’s Pizza. Thus, it is possible that for these three chains, they have high revenues because of these good ratings, since if potential customers see that a business has good reviews, they are more inclined to visit. It is also interesting to note from the first plot (Frequency of Review Star Ratings by Chain) that there is generally a decrease in the number of reviews as revenue decreases. It’s possible that the more reviews a chain has, the more customers it has which leads to higher revenue. We will need to dig deeper to see if review star ratings actually have an effect on a chain’s success.

Plot: Average Star Rating vs Revenue

Analysis of Average Star Rating (Out of 5 Stars) vs Revenue Plot

The range for the overall star ratings from Yelp for the top pizza chains is [1.97, 3.30]. From the graph, it seems like the first four and the last three pizza chains of the top ten are relatively low compared to the middle three. This might make you wonder, why are the top two (out of the top ten) pizza chains, Dominos and Pizza Hut, rated so low? Why are the middle three pizza chains, Papa Murphy's Pizza, California Pizza Kitchen, and Marco's Pizza, rated so high?

There are some factors one has to consider about the nature of Yelp reviews in order to analyze why these businesses are so successful regardless its Yelp review. Yelp reviews mostly consist of reviews of relatively good or bad experiences and have less mediocre or standard satisfied reviews. A person is more likely to go out of their way to share their experience with a particular restaurant if they have an overwhelmingly positive or negative personal experience either with the food, atmosphere, or staff of the restaurant. There are other factors as well, such as faulty reviews, etc., that can be found here that could contribute to the results found.

The most successful pizza chain in terms of revenue is Domino's, and they have the most units in the country. But why is Domino's' average star rating only a 2.22 out of 5 stars on Yelp? With Domino's being such a well-known restaurant selling millions of pizzas a day, most people would most likely have a relatively good or average experience with Domino's that wouldn't incite them to make a Yelp review for the restaurant. However, if a customer has a relatively negative experience with the quality of the food or the staff at Domino's, or a well known restaurant in general, then they are more likely to write a review about their experience because these top businesses are expected to have high standards for their restaurants and upkeep a good reputation while a lesser known restaurant does not face these same expectations.

This still leaves the question of why did Papa Murphy's Pizza, California Pizza Kitchen, and Marco's Pizza have such high average star ratings on Yelp? After more research, it is clear that these pizzerias have something special about each of them that make them stand out from the average fast food pizza chain. Papa Murphy's is famous for their take-and-bake pizzas and also for their huge variety of items and fresh quality ingredients. California Pizza Kitchen is famous for their innovative pizza creations such as their Thai Chicken Pizza. Finally, Marco's Pizza is the only pizza place in the US that was founded by an Italian person; this pizzeria is known for their amazing quality, atmosphere and has had two award-winning pizzas.

Locations of Top Pizza Chains

We will plot all of the locations of the top pizza chains in the US.

A key of the colors:

Analyzing Chains in Major Cities

Locations of Top Pizza Chains in Major Cities

We also plotted the top pizza chain locations in the top twenty most populated cities in the US (list of cities taken from World Population Review).

When looking at the locations in the most populated cities in the US compared to all of the locations in the US as a whole, it is clear that the majority of the pizzerias are not in major cities. However, since the major cities are the most populated in the country, locations in these cities are more likely to bring in more customers and more revenue.

Plot: Percentage of Locations of Top Pizza Chains in Major Cities

Analysis Percentage of Locations in Major Cities Plot

As you go from left to right (highest to lowest revenue) across the chains, you'll notice that there is a general decrease in percentage of locations that are in major cities (vs non major cities in the US), except for a few outliers. Perhaps, the more revenue a pizza chain has, the more locations they are able to open in order to make even more money.

The three pizza chains that stand out from the trend are Papa John's, California Pizza Kitchen, and Chuck E. Cheese's. One speculation of why Papa John's has so many locations but less revenue than the top three pizza chains could be due to the history of success, but the more recent scandals and controversial remarks that earned Papa John's and its CEO a bad reputation ethically, especially from 2017-2018 (only a year or two before this data of 2019 revenue). The information about Papa John's scandals can be found here.

A speculation of why California Pizza Kitchen has so many locations in major cities is because the headquarters is in Los Angeles, CA which is the second most populated city in the US. It is expected for more locations of a chain to exist in the same city as its headquarters.

A speculation of why Chuck E. Cheese's has so many locations in major cities, but has lower revenue (from their pizza distributor Peter Piper Pizza) is due to the fact that Chuck E. Cheese's main business service pertains to children's entertainment, and pizza is just a food service that they provide at their locations. Therefore, Chuck E. Cheese would most likely be in major cities where there are more children and families as their target audience.

Analyzing Prices

Now we will analyze menu item prices for each pizza chain.

Plot: Price Range vs Revenue

Analysis of Menu Item Price Distribution by Pizza Chain Plot

From the box plot, we can see the general range of prices for the menu items for the top pizza chains. One attribute of the box plots that we notice is that the majority of the first quartiles are greater than or equal to \$10; thus, the majority of the items for these restaurants are priced above \\$10. The three chains that have a first quartile below \$10 are Little Caesars Pizza, Sbarro, and Round Table Pizza.

Let's look into how these three restaurants are able to still make profit despite their low selling prices. Some reasons for why these restaurants are so cheap but still making a high revenue could be because of discounted ingredients, careful portioning, convenience in certain locations, and other tactics.

Part 3: Machine Learning

Now that we've analyzed star ratings, location, and prices, let's try to see if there exists a predictive relationship between median menu item price, average star ratings, and total revenue using linear regression. For the regression, we used median menu item price since median is a more robust measure of central tendency than mean. As you can see in the above box plot, many of the price ranges for each chain contains outliers.

Linear Regression for Median Item Price vs Revenue

Analysis of Linear Regression for Median Item Price vs Revenue

The regression line shows that there is a slight positive correlation between median item price and revenue. However, the points in the graph do not fit the line very well, and the r-squared value (0.024) is quite low. This means that there isn't truly a predictive relationship between median item price on a menu of a chain and revenue. This could be because of the lower six pizza chains in the top ten, which together have a wide range of median prices. This means that there must be other factors besides price that contribute to the lower six chains' success, specifically Sbarro and Chuck E. Cheese's since they are the furthest from the regression line.

Linear Regression for Median Item Price vs Average Star Rating

Analysis of Linear Regression for Median Item Price vs Average Star Rating

The regression line on the plot shows that there is a slight positive correlation between median menu item price and average star rating of a chain. However, the r-squared value (0.035) is quite low, meaning there isn't truly a correlation between these two categories. We believe the slope of the regression line is due to the Sbarro and Chuck E. Cheese's points since they are the extrema while all of the other points are closer to the center of the graph. The lack of a predictive relationship between median price and average star rating means that price does not as much of an effect on a reviewer's star rating of a chain as we thought. There must be other factors that go into a reviewer's star rating of a pizza chain.

Linear Regression for Average Star Rating vs Revenue

Analysis of Linear Regression for Average Star Rating vs Revenue

Based on the slope of the regression line and the r-sqared value (0.429), it seems that there is a slight predictive relationship or correlation between average star rating of a chain and revenue. As revenue increases, the predictive averge star rating decreases. It is interesting how the more successful a restaurant is in terms of revenue, the worse their Yelp review ratings are. We had originally hypothesized the opposite: the more successful a restaurant is, the higher their Yelp star rating should be. As we've mentioned before, this could make sense due to the several factors of how Yelp reviews work.

Most people would most likely have a relatively good or average experience with a bigger chain that wouldn't incite them to make a Yelp review for the restaurant. However, if a customer has a relatively negative experience with the quality of the food or the staff at a bigger chain or a well known restaurant, then they are more likely to write a review about their experience because these top businesses are expected to have high standards for their restaurants and upkeep a good reputation versus a lesser known restaurant.

Final Thoughts

Based on our analysis, we found that menu item prices and locations could potentially have an effect on a pizza chain’s success. In some cases, such as for Little Caesars Pizza and Sbarro, having low menu item prices could possibly help the chain generate more revenue. It's also possible that the more major city locations a pizza chain has, the more revenue they can generate. We also found that the Yelp reviews do not accurately reflect which businesses are successful and that there are other factors other than price that go into a Yelp review rating. There are many factors that can make a pizza chain successful, and to know the full extent of these factors would take much more than a single tutorial and a couple of datasets. For further analysis on this topic, it’d be interesting to analyze specific menu items, quality of ingredients, partnering retailers, and customer demographics. In our tutorial, we did not cover these attributes due to the lack of (free) data available.